Close Copy Speech Synthesis for Speech Perception Testing

نویسندگان

  • Jolanta Bachan
  • Dafydd Gibbon
چکیده

The present study is concerned with developing a speech synthesis subcomponent for perception testing in the context of evaluating cochlear implants in children. We provide a detailed requirements analysis, and develop a strategy for maximally high quality speech synthesis using Close Copy Speech synthesis techniques with a diphone based speech synthesiser, MBROLA. The close copy concept used in this work defines close copy as a function from a pair of speech signal recording and a phonemic annotation aligned with the recording into the pronunciation specification interface of the speech synthesiser. The design procedure has three phases: Manual Close Copy Speech (MCCS) synthesis as a “best case gold standard”, in which the function is implemented manually as a preliminary step; Automatic Close Copy Speech (ACCS) synthesis, in which the steps taken in manual transformation are emulated by software; finally, Parametric Close Copy Speech (PCCS) synthesis, in which prosodic parameters are modifiable while retaining the diphones. This contribution reports on the MCCS and ACCS synthesis phases. 1 Objectives and context for Close Copy Speech synthesis development 1.1 Objectives and procedure The aim of this study is, first, to develop a restricted domain speech synthesis concept for automatically generating acoustic stimuli for use in evaluating cochlear implants for children and, second, to implement a prototype synthesiser. The main motivation for including a speech synthesiser in the system is to increase the flexibility of the available test stimuli. The basis for the synthesiser is the Close Copy Speech (CCS) synthesis or resynthesis method, in which it is the task of the synthesier to “repeat uterances produced by ahuman speaker with a synthetic voice, while keping the orignal prosody” (Dutoit, 197). In this method, "close copy" means that the synthetic speech is as similar as possible to a human utterance. In fact, in the present 1 The presnt development project is part of the Cochlear Implant Testing project led by Grażyna Demenko, and the M.A. thesi of Jolanta Bachan under the supervison of Grażyna Demenko and Dafydd Gibbon. Special thanks are due to Thorsten Trippel for the initial BLF2TextGrid conversion via the TASX XML format, and Arne Hellmich for suggestions for the BLF2MBROLA conversion. I N V E S T IG A T IO N E S L IN G U IS T IC A E , V O L . X III; P O Z N A Ń , D E C E M B E R 2 0 0 6 Investigationes Linguisticae, vol. XIII 10 context, "copy" means that the input to the synthesis engine for a given utterance is derived directly from a corresponding utterance in the annotated corpus data. The method can be taken a step further, in parametrising the prosody, so that modifications of the original prosody (speech timing and pitch patterns) can also be systematically introduced. For the purposes of this study, MBROLA, a de facto standard diphone synthesis engine with a suitably modular language-tospeech interface, was selected (Dutoit 1997). In the present study, the definition by Dutoit is interpreted to mean that the Natural Language Processing or Text-To-Speech (TTS) component of the synthesiser is replaced by an analysis of a recorded speech signal. The analysis in the present context consists of a recorded speech signal, a method for pitch extraction from the speech signal, and an aligned phonemic annotation of the speech signal. The development procedure used in this study has three phases: 1. Manual Close Copy Speech (MCCS) synthesis: manual transfer of parameters from the original signals and annotations to the synthesiser interface. 2. Automatic Close Copy Speech (ACCS) synthesis: automatic transfer of parameters from the original signals and annotations, based on specifications derived from the MCCS phase. 3. Parametric Close Copy Speech (PCCS) synthesis: interactive and automatic parametrisation at the ACCS derived synthesiser interface. This paper reports on the background to the development, and on the MCCS and ACCS synthesis phases of the development. 1.2 Context of the TTS development The context of the present development is a project for testing the functionality of cochlear implants in children. The project strategy involves the development of tests supported by software, administration of the tests to normal children, children with hearing aids, and to children with cochlear implants. An overview of the context is shown in Figure 1; the individual components are needed for defining the use cases and the use case based requirements for the speech synthesiser. Figure 1: Project context for TTS software development 1.3 Overview of the paper The present study is concerned with developing a Close Copy Speech synthesis subcomponent for component #2 shown in Figure 1. Evaluation feedback is expected from all other components. The components #3, #4, #5, #6 and #7 serve to define use cases for deployment of the TTS software; the main use cases considered are #3, test presentation development and #4, test administration. Jolanta Bachan and Dafydd Gibbon: Close Copy Speech Synthesis for Speech Perception Testing 11 The paper deals, first, with the project requirements and use cases which feed into the CCS synthesiser development; second, with system requirements; third, with the MCCS development phase; fourth, with the ACCS development phase; finally with a conclusion concerning the application and evaluation procedures. 2 Requirements: use cases 2.1 Use case: Test presentation development (component #3) The battery of speech perception tests for children with a cochlear implant was created at Adam Mickiewicz University. In the project, linguists, phoneticians, graphics designers and computer programmers were involved. The tests were designed in close cooperation with experts from the Medical Academy, and audiologists from the Marke-med centre, both in Poznan. The tools for administering these tests contain two types of speech perception test: 1. Nonsense tests: tests with nonsense stimuli. Some of the tests in this set make use of synthesised stimuli. The aim of these tests is to assess whether the subject is able to take the verbal tests. 2. Verbal tests: tests with verbal stimuli. Both sets of tests examine children's perceptive and linguistic skills making use of acoustic signals only. There are no visual cues in the test procedure, so the subject cannot lip-read. In both kinds of test the subject answers by pointing at a picture on a computer screen. The tests were designed for young children, and touch screens were provided for children who did not know how to use a computer mouse. The tests with verbal stimuli are designed for children who are able to comprehend speech, but who may be unable to give verbal responses. In these tests six different voices were used to test intelligibility of different voice pitches. The tests make use of the following voices: two male adults, two female adults, one male child, one female child. The results of the first series of tests in this use case indicated that more flexibility would be provided by more extensive use of a speech synthesiser of higher quality than currently available. This result provided part of the motivation for the development of a CCS synthesis system. 2.2 Use case: Test administration by perception testers (component #4) The perception tests are designed for use by audiologists and speech therapists. They can be used by the audiologist in programming the cochlear implant, or by the speech therapist as an achievement test. The set of speech perception tests is also useful teaching material and it can be used by parents to help their children work on their perceptive skills. The standard graphical user interface will need to be extended by manipulation options for synthesised voices. Figure 2 shows the scenario of the tests. During the testing procedure three subjects are involved: the child, the tester and the computer. A parent's presence during the tests is optional. In the first stage of the testing procedure the tester provides the subject with instructions. If the subject understands the instructions, the tester runs the tests and the testing material appears on the computer screen. If the subject cannot understand the instructions, the test is terminated. The computer provides acoustic stimuli for the child, the tester and (if present) the parent. Then the child responds to the stimuli by pointing at a picture visualising the acoustic stimuli. If the child does not know what the stimulus is, he or she asks the tester or the parent questions. In principle, the tester is not allowed to give hints, but, for the purpose of this preliminary research (evaluating the tests), the testers may help the children with the tests if necessary. Similarly, the parents are asked questions by their children, and despite the fact that in principle they are also not allowed to give help, it is understandable that the parents help their little children with answers, and this is currently permitted. This kind of cooperation between the child, the tester and/or the parent is one of the main complicating factors in assessing the structure of the tests and the dialogue between the child and the computer. All the responses given by the child to the computer are collected and the results of the test are available on the computer screen to the tester. Finally, the tester notes down the results for future processing. Investigationes Linguisticae, vol. XIII 12 Figure 2: Test scenario showing communication relations between child, computer, tester and parent 2.3 Use case: Test evaluation (components #5, #6) The set of speech perception tests was evaluated by students of linguistics at Adam Mickiewicz University. The evaluation of the tests started in September 2005 and went as follows: 1. In September and October 2005 the verification of the preliminary version of the set of the verbal tests and the set of tests with nonsense stimuli was carried out on Polish children with normal hearing. The two sets of tests were administered to 19 five-year-olds and 18 six-year-olds. The children's hearing was examined by audiologists. All the children had normal hearing and were normally developed. 2. In May and June 2006 the verification of the corrected and completed version of the set of verbal tests was conducted on Polish children with normal hearing. 14 four-year olds, 21 five-year-olds and 22 six-year-olds tok part in the verifcation. The children’s hearing was examined beforehand by audiologists. All the children, except one four-year-old, had normal hearing and were normally developed. 3. In June and July 2006 the set of verbal tests was verified on children with hearing aids and children with cochlear implants: 1. Two Polish children with hearing aids sat some of the tests. One of the children was seven years old, the other was twelve years old. 2. A group of 15 Polish children with a cochlear implant took some of the tests. The children were at different ages. The youngest children were 2.5 years old, the oldest were 11 years old. All the children were prelingually hearing-impaired. Only one girl lost her hearing at the age of five after having acquired a good command of speech. Results in these scenarios can be compared in order to determine which manipulations of prosodic parameters lead to the best test results. The effectiveness of the set of speech perception tests was evaluated qualitatively by fourth-year students of linguistics. In parallel to this, the tests were evaluated by audiologists. Note that the testers were concerned with evaluating the perception tests, not the actual cochlear implants. The focus of the research was on evaluation of the level of efficiency, ergonomics, motivation and suitability of the tests for the subject. The testers evaluated many parameters. The relevant parameters for CCS development are as follows: 1. the intelligibility of the instruction, picture and sound combinations used in the tests, 2. the dialogue between the child and the computer. Jolanta Bachan and Dafydd Gibbon: Close Copy Speech Synthesis for Speech Perception Testing 13 The problems discovered were: 1. Tests with nonsense stimuli: 1. Synthesised stimuli in the set of tests with nonsense stimuli were of poor quality. 2. The children had problems understanding the instructions to the tests with nonsense stimuli. 2. Tests with verbal stimuli: 1. Some sounds were very difficult to recognise, because of the speaker's fast speech rate. 2. The pitch of the female voice was too low. 3. The accentuation was not prominent enough for the purpose of some tests. 4. Some sounds were segmented incorrectly. 5. Some sounds were missing. 6. The dialogue between the testee and the computer needs improvement. Children sometimes did not know whether they gave a correct answer or not. They also looked at the testers or the parents for a sign of confirmation before giving the answer. 7. If children with a cochlear implant could not understand the stimuli, they wanted to read the word from the testers' or their parents' lips. 8. There is no test including stimuli presented in noise. For discussion of these results, see Bachan (2006). The results provided a rather specific set of requirements for CCS development. 2.4 Use case: Software evaluation (component #7) The task for the software evaluation use case is to coordinate evaluation results from other components in the form of recommendations to the software developers. In practice, evaluation results may go directly to the software developer, but in the ideal case the software evaluator will relate the evaluations to the original project goals before proposing software revisions and further development. Based on the original project goals, some future directions for software development emerged:

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Construction of Perception Stimuli with Copy Synthesis

A number of experiments in perception requires the construction of speech-like stimuli whose acoustic content needs to be manipulated easily. Formant synthesis offers the possibility of editing all the parameters of speech. However, the construction of stimuli by hand is a very laborious task and therefore automatic tools are necessary. This paper describes two main extensions of a copy synthes...

متن کامل

Correlation between Auditory Spectral Resolution and Speech Perception in Children with Cochlear Implants

Background: Variability in speech performance is a major concern for children with cochlear implants (CIs). Spectral resolution is an important acoustic component in speech perception. Considerable variability and limitations of spectral resolution in children with CIs may lead to individual differences in speech performance. The aim of this study was to assess the correlation between auditory ...

متن کامل

Study on Unit-Selection and Statistical Parametric Speech Synthesis Techniques

One of the interesting topics on multimedia domain is concerned with empowering computer in order to speech production. Speech synthesis is granting human abilities to the computer for speech production. Data-based approach and process-based approach are the two main approaches on speech synthesis. Each approach has its varied challenges. Unit-selection speech synthesis and statistical parametr...

متن کامل

Persian Cued Speech: The Effect on the Perception of Persian Language Phonemes and Monosyllabic Words with and without Sound in Hearing Impaired Children

Objectives: This paper studies the effect of Persian Cued Speech on the perception of Persian language phonemes and monosyllabic words with and without sound in hearing impaired children. Cued Speech is a sound based mode of communication for hearing impaired people that is comprised of a limited series of hand complements and the normal pattern of speech. And it is shown that it effectively ca...

متن کامل

Reliability of Interaural Time Difference-Based Localization Training in Elderly Individuals with Speech-in-Noise Perception Disorder

Background: Previous studies have shown that interaural-time-difference (ITD) training can improve localization ability. Surprisingly little is, however, known about localization training vis-à-vis speech perception in noise based on interaural time difference in the envelope (ITD ENV). We sought to investigate the reliability of an ITD ENV-based training program in speech-in-noise perception a...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007